-
more conservative v0.1.0 annotation
aquacul4
running
Also will attempt to clean up first blast output ((~58k hits))
> removing alnlength <100
> removing e-value < E-10
> sorted by query, then evalue. Then removed duplicates on query and query start column. (reduced to 5786)
> repeated with query end value (3891)
Then going with annotation on Galaxy
new GFF
Blast2gff code looks something like this
./Blast2Gff.pl -i /Volumes/Bay3\ scratch/gff_fun/7 -o /Volumes/Bay3\ scratch/gff_fun/Combined_fosmids_cd_hit_mod_20000_7trim.gff -d "sigenae_v8" -p EXON -s "something"
Flipping it around and taking sigenae v8 and blasting tgagag v0.1.0 on Server (SW) using -G 1 and -E 1
will modify Blast2gff script to try pull out relevant information.
BLAST COMPLETE
7.4 Million lines
will Use Galaxy to filter…
align length >100
;; down to 43,061 lines..
from original
evalue < 0.01
;; down to 110,000 lines
from original
evalue < 0.0001
;; down to 65,842 lines
from there going to trim to > 100 algnlength
;; now at 37,274 lines
---------
Back to original
trimming on %ID
c3>=95
about a million lines
---
Now will filter
the 37274 file (evalue, algnlength)
with
c3>=95
1105 lines
NEW GFF file
also known as Annotation2_cg_v010
NOTE need to have col 9 indicate "name="
--
running an MBD ref map on it.
--
IDEA need to get know gene structure an validate an approach.